HIMA2: high-dimensional mediation analysis and its application in epigenome-wide DNA methylation data

Mediation analysis plays a major role in identifying significant mediators in the pathway between environmental exposures and health outcomes. With advanced data collection technology for large-scale studies, there has been growing research interest in developing methodology for high-dimensional mediation analysis. In this paper we present HIMA2, an extension of the HIMA method (Zhang in Bioinformatics 32:3150–3154, 2016). First, the proposed HIMA2 reduces the dimension of mediators to a manageable level based on the sure independence screening (SIS) method (Fan in J R Stat Soc Ser B 70:849–911, 2008). Second, a de-biased Lasso procedure is implemented for estimating regression parameters. Third, we use a multiple-testing procedure to accurately control the false discovery rate (FDR) when testing high-dimensional mediation hypotheses. We demonstrate its practical performance using Monte Carlo simulation studies and apply our method to identify DNA methylation markers which mediate the pathway from smoking to reduced lung function in the Coronary Artery Risk Development in Young Adults (CARDIA) Study. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04748-1.

Our motivating example comes from the DNA methylation (DNAm) research of the Coronary Artery Risk Development in Young Adults (CARDIA) Study [23]. In the DNAm process, methyl groups are added to DNA at binding sites referred to as cytosine-phosphate-guanine (CpG) islands, which inhibits the binding of transcription factors to DNA and results in changes (typically down regulation) to the expression of genes [24]. The platform Illumina MethylationEPIC Beadchip array is used to measure DNAm levels of roughly 850 K probes, which are ultra high-dimensional. Such high-dimensional DNAm markers may mediate pathways linking environmental exposures with health outcomes. Our objective is to explore the mediating role from high dimensional DNAm markers on the relationship between smoking and lung function in the CARDIA study.
In this paper, we propose an improved estimation and inference procedure for the high-dimensional mediation model, extending the work of Zhang et al. [3]. Our method includes three major steps: First, to tackle the ultra-high dimensionality of the DNAm markers, we screen out potentially a large number of mediators using a series of marginal mediation effect pathways (exposure → mediator → outcome). Second, we adopt the de-biased Lasso method [25] to estimate the high dimensional regression coefficients (mediator → outcome). Third, we employ a joint significance test with a mixture of null distributions to accurately control the FDR for large-scale multiple tests [22].
The remainder of this paper is structured as follows. In "Methodology" Section, we propose a three-step inference procedure for mediation effects in the high-dimensional regression model. In "Simulation studies" Section, we evaluate the performance of our method via numerical simulations. In "Application" Section, an application to the CARDIA study is provided. Finally, some discussion and concluding remarks are presented in "Conclusion and remarks" Section.

Methodology
Denote the exposure as X , baseline covariates to be adjusted for as Z = (Z 1 , . . . , Z q ) T , where the superscript T denotes the transpose of a vector or a matrix. We adopt the following counterfactual framework for the vector of potential mediators M(x) = (M 1 (x), . . . , M p (x)) T under exposure level x , and counterfactual Y (x, m) under exposure level x and mediators level m, to perform the mediation analysis [26]: where γ is the direct effect of exposure on the outcome; β = (β 1 , . . . , β p ) T is the regression parameter vector relating the mediators to the outcome; α = (α 1 , . . . , α p ) T is the parameter vector relating the exposure to mediators; η and δ j are vectors of regression coefficients for the covariates; and ε and e j are error terms in Models (1') and (2'), respectively. Note there are p submodels in Model (2'), one for each mediator. We allow the correlation between the error terms, i.e., e = (e 1 , . . . , e p ) T ∼ N (0, � e ) , where e is a positive definite covariance matrix.
A few causal assumptions that are needed for the identification of natural direct effect (NDE) and natural indirect effects (NIE) are listed below [41][42]: A1. Stable unit treatment value assumption (SUTVA) for both the mediators and the outcome. This assumption means that there is no multiple versions of exposures and there is no interference between individuals, which implies that M(x) and Y (x, m) are well defined.
A2. Consistency for the mediators and the outcome. That is, there are no measurement errors in the mediators and thus the observed variables satisfy M = M(X) and Y = Y (X, M).
A3. Sequential ignorability: This assumption contains 4 parts: (A3.1) X ⊥ Y (x, m)|Z, i.e., no unmeasured confounding between exposure and the potential outcome; (A3.2) M ⊥ Y (x, m)|X, Z, i.e., no unmeasured confounding between mediators and the potential outcome; (A3.3) X ⊥ M(x)|Z, i.e., no unmeasured confounding between exposure and the potential mediators; (A3.4) M x ′ ⊥Y (x, m)|Z, i.e., no exposure-induced confounding between mediators and the potential outcome. In other words, the potential mediators under any intervention level m are independent of potential outcomes under any intervention x and mediator level x ′ given covariate Z.
A4. No direct causal relationship between mediators. We do not allow one mediator to be the cause of another, but we do allow them to have shared common causes.
Under A1-A3, we have direct effect Under the additional assumption A4, we can decompose the indirect effect into sum of indirect effects through each mediator M j , NIE j = α j β j . Also we obtain the structural equation model for the observed outcome as in previous literature [3] to assess the mediation effects of highdimensional mediators: Our goal is to estimate and test the mediation effects α j β j jointly for j = 1, . . . , p. An illustration of mediation analyses with single mediator and high dimensional mediators is given in Fig. 1.
As shown in Fig. 1, we do allow these mediators to share common unmeasured causes. These assumptions are in line with the underlying biologic procedures. Smoking could induce biochemical alterations to the DNAs, which lead to methylation changes. Such change in a certain CpG site is unlikely to directly cause the methylation alternation of other CpG sites. Rather, such dependency is most likely to be indirect, for example, by regulating gene expressions in related pathways that in turn modify other CpGs, or several CpGs are modified by common unmeasured causes (e.g., inflammatory response).
All M j 's are scaled with mean zero and unit variance before performing this screening procedure. The key advantage of Step 1 is that the product term α j β j could roughly describe the mediated effect of the j th mediator. Therefore, the selected subset D contains true mediators with a large probability.
Step 2: (De-biased Lasso). We consider the following submodel based on the selected set D, where β D and M D denote sub-vectors of β and M with index belonging to D respectively, and β D is estimated using the de-biased Lasso method, with estimator β j and its standard error σ β j obtained in [25]. The corresponding p-values are given as: 1 Mediation analysis of A a single mediator; B high dimensional mediators, plotted similarly to [3]. An arrow from X to U is possible though omitted to avoid the complexity in interpreting α as the total effect where �(·) is the cumulative distribution function of N (0, 1). De-biased Lasso in Step 2 is necessary as the ordinary least square will yield inefficient estimates (with reduced power), because the dimension of survived mediators after Step 1 is still relatively large.
Step 3: (Joint Significance Test). We consider the multiple testing problem for jǫD as follows: with corresponding p-value where P β j is given in (6), P α j = 2 1 − � α j / σ α j , α j and σ α j are OLS estimators.
Zhang et al. [3] considered the joint significant test (termed "JS-uniform"), which assumes that P j follows a uniform distribution. However, although P α j and P β j are each uniformly distributed, their maximum is not. As a result, the significance rule using the uniform null distribution for P j results in a valid but overly conservative test [28]. In this paper, we will adopt the "JS-mixture" approach to accurately control the FDR [22] The multiple testing problem (7) is equivalent to the union of the following three disjoint component null hypotheses, That is, P j is a 3-component mixture distribution instead of the uniform distribution. Dai et al. [22] proposed the following estimated FDR for testing mediation: where π 01 , π 10 and π 00 are the estimates of proportions H 01,j , H 10,j and H 00,j , respectively, and We define the significant threshold for P j as t b = sup t : FDR(t) ≤ b , to control the FDR at level b . Then S = j : P j ≤ t b , j ∈ D gives the estimated index set of significant mediators.
We can obtain π 01 , π 10 , π 00 and t b using the R package HDMT [22]. Compared to the estimation and inference method in [3] (termed "HIMA"), our new method (termed "HIMA2") has the following three advantages. First, HIMA only considers β (mediator → outcome) for screening in Step 1, while HIMA2 considers the indirect effect of αβ . Therefore, the mediation-based screening method in HIMA2 addresses the indirect effect more accurately than HIMA. Second, HIMA uses the minimax concave penalty (MCP; [29]) technique to estimate the effect β , which can only provide p-values for selected mediators in Step 2. That is, P β j is set to 1 for those not selected, which results in poor estimate of P j in Eq. (7). In contrast, de-biased Lasso in HIMA2 yields p-values for all β j 's in D , which gives more appropriate estimate of P β j . Third, HIMA adopts a naive joint significance rule assuming a uniform null distribution for the maximum p-value calculation in Step 3, which may result in a valid but overly conservative test with lower power.
We compare the performance of HIMA2 with HIMA in Table 1, which provides the estimated biases (Bias) given by the sample mean of the estimates minus the true value, and the mean-square error (MSE) of the estimates. Table 1 shows that both HIMA2 and HIMA are unbiased, however, HIMA2 has smaller MSEs than HIMA for significant mediators. MSEs for both HIMA and HIMA2 decrease as the sample size increases. Of note, the results for j > 8 ( α j = 0 and β j = 0 ) are close to those of j = 8 and thus omitted.
We also present the estimated FDR and power of mediation effects testing in Tables 2  and 3, where the nominal level is 0.05. The results indicate that both HIMA2 and HIMA can achieve valid FDR control. Furthermore, HIMA2 is more powerful than HIMA in selecting significant mediators, though the differences become smaller when sample size increases. We also note that as the correlation among the mediators becomes larger, both methods suffer in terms of power.
Per suggestion from a reviewer, we compare our method to HDMA [30], which was developed along the lines of HIMA but adopts the de-biased Lasso method in Step 2. However, no multiple testing adjustment was used in HDMA for inference. As a result, HDMA suffers from poor FDR control albeit with higher power as shown in Tables 2  and 3.
Per suggestion from a reviewer, similar to our real data analysis, we also consider a setting with 2 significant mediators, i.e.: β 1 = 0.15, β 2 = 0.3, β 3 = 0.1, β 4 = 0, and β j = 0 for all other j's; α 1 = 0.15, α 2 = 0.3, α 3 = 0, α 4 = 0.1, and α j = 0 for all other j 's. As shown in the supplementary materials (Additional file 1: Tables S1, S2 and S3), we   observe similar results to those in Tables 1, 2 and 3. We note that the results from HIMA and HIMA2 are more close to each other when the correlation is high ( ρ = 0.75). Per suggestion from a reviewer, we use the standardized coefficient estimates in the SIS step, but the results are close to those without standardization (results available upon request).
Finally, we notice that in Tables 2 and Additional file 1: Table S2, the FDR of HIMA2 decreases with sample size. This also happens with HIMA, though to a less magnitude.

Application
We apply our method to the Coronary Artery Risk Development in Young Adults (CAR-DIA) Study, an ongoing longitudinal cohort examining the development and determinants of clinical and subclinical cardiovascular disease and their risk factors [23].   We are interested in investigating how the DNA methylation (DNAm) markers mediate the relation between smoking and lung function. Due to budget limitation, 1200 individuals from the CARDIA participants at Year 15 were randomly selected for DNAm profiling using the Illumina MethylationEPIC Beadchip (p = ~ 850,000 sites). The R package Enmix [31] was used to perform quality control, background correction, dye bias correction, quantile normalization (by probe types), and extreme outliers removal. Eventually, the DNAm measurements were obtained for a total of 1042 blood samples, which are treated as mediators in this study. The FEV1 (forced expiratory volume in 1 s) measured at Year 20 is considered as the lung function outcome. The number of cigarette packs/year in Year 10 is the exposure variable. We are interested in building the mediation pathway in sequence: smoking at Year 10 → High dimensional DNAm markers at Year 15 → lung function at Year 20.
Our analysis adjusts for age, height, weight, study center, gender, and race in Models (1) and (2). Additionally, we estimated the proportions of CD4 + T lymphocytes, CD8 + T lymphocytes, B lymphocytes, natural killer cells, monocytes, and granulocytes using [32], which are also adjusted in the models. To account for experimental batch effects and other technical biases, we derive surrogate variables from intensity data for non-negative internal control probes using principal components (PCs) analysis [31]. The top eight PCs, explaining 95.06% of the variation across the non-negative internal control probes, are also adjusted as covariates in the model. All the covariates are measured at Year 10.
After screening in Step 1, the average of the absolute values of correlation among CpGs is 0.25 (max 0.93). In Table 4, we present the summary results on selected mediators. For FDR < 0.05, HIMA2 identifies 2 CpGs: cg26331243 and cg19862839 as mediators. CpG cg26331243 is located in the body region of gene CCDC33, which is differentially expressed for tobacco smoke exposure [33,34]. CCDC33 is also linked to susceptibility to lung function disorders, e.g., pneumococcal meningitis [35] and SARS-CoV-2 infection [36]. Therefore, it is plausible that cg26331243 plays a role in regulating the expression of CCDC33, which in turn mediates the pathway from smoking to lung function.
CpG cg19862839 is located in the body region of gene TBX4. Growing evidence has indicated that TBX4 variants are associated with a wide spectrum of lung disorders [37,38]. Patients with mutations in TBX4 may also be more susceptible to cigarette smoking [39]. Therefore, we speculate that cg19862839 could participate in regulating the expression of TBX4, which also acts as a mediator between smoking and lung function. In comparison, HIMA only identifies cg26331243 as a mediator with FDR < 0.05. Therefore, the proposed HIMA2 has better power to identify CpGs in high dimensional mediation analysis.
Finally, we note that cg05575921, which was identified in the normative aging study (NAS) [3], is not a significant mediator in the CARDIA study. In CARDIA, the estimate of α (from smoking to DNAm) is highly significant for cg05575921. However, the estimate of β (from DNAm to FEV1) is not significant. This may be due to that participants in CARDIA were much younger (mean age 45 at Year 20, range 38-55) than NAS (mean age 74, range 55-100), when the lung function of CARDIA participants are more homogenous. Therefore, the association between DNAm to lung function at Year 20 may not be significant in CARDIA.
In the current analysis, there is a 5-year gap between the exposure and the mediator. A reviewer raised the concern on treatment-induced-mediator-outcome confounding. The life-course smoking trajectories for the majority of individuals were relatively stable before age 40-45, which corresponds to the Year 10-15 of our study cohort [40]. Although DNA methylation is modifiable by smoking, it is still a relatively stable biomarker over time [41]. Short-term exposure-induced covariates within a 5-year gap (if any) are unlikely to produce biologically functional changes in DNA methylation for us to detect as mediators.

Conclusion and remarks
In this paper we proposed an improved method HIMA2 for high dimensional mediation analysis, which was shown to have better performance than HIMA [3] by numerical studies. We applied HIMA2 to the identification and testing of the DNA methylation mediating effects in the CARDIA study. Our method is relatively simple to implement, and can be widely used in high-dimensional mediation analyses.
Our method can be extended in several directions. First, we will consider how to address the correlation among DNA methylation markers to improve the inferential results, as shown in the Simulation Studies that both HIMA and HIMA2 lose power for high correlation. Second, it is of interest to incorporate the interaction terms between the exposure and the mediators in our model, i.e., the high dimensional moderated mediation analysis. Third, there has been an increasing interest and development in longitudinal studies of DNA methylation. We can also consider repeated measures of DNA methylation markers as mediators in our future research.